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Establishing  a  Data  Resource  Centre 

Experiences  at  the  University  ofGuelph 


Introduction: 

The  following  paper  outlines  the  process 

of  establishing  a  Data  Resource  Centre 

(DRC)'.  The  paper  documents  the 

experiences  at  the  University  of  Guelph, 

where  such  a  service  was  established 

from  scratch,  and  where  gains  have  been 

made  relatively  quickly.  Prior  to  the  fall        ^^^^HI^^H 

of  1996  Guelph  was  in  a  situation  similar 

to  many  other  research/teaching  institutions.  There  were  no 

formal  procedures  in  place  for  acquiring,  distributing  and 

analyzing  data  in  an  electronic  format.  It  was  the 

responsibility  of  individual  faculty,  researchers  and 

students  to  develop  the  necessary  skills  to  make  use  of  data, 

there  was  limited  statistical  support,  and  overlap  existed  in 

acquiring  data  resources.  All  of  this  resulted  in  duplication 

of  effort  with  respect  to  the  use  of  electronic  information  on 

campus. 

It  is  hoped  that  an  account  of  these  experiences  can  be  of 
use  to  others  currently  in  the  process  of  estabUshing  a 
DRC,  as  well  as  those  considering  undertaking  such  an 
endeavour.  To  that  end,  this  paper  is  written  in  an  easy  to 
follow  manner,  with  limited  technical  details.  Certain  goals 
and  objectives  were  set,  and  this  paper  looks  at  how  these 
goals  are  being  achieved  and  some  of  the  obstacles 
encountered.  Issues  such  as  motivation,  targeted  audience, 
teaching  needs,  research  needs,  levels  of  service,  staffing, 
hardware,  software,  security,  and  deUvery  tools  will  be 
discussed. 

Establishing  a  data  resource  centre  (DRC)  can  be  a  very 
complicated  process.  For  the  people  involved  in  the  front 
line  deUvery  and  use  of  electronic  information,  the  needs 
and  benefits  have  previously  been  laid  out  and  shown  to  be 
substantial- .  The  challenge  to  the  manager  of  a  data  centre 
is  to  express  these  benefits  in  such  a  way  that  the 
administrators  who  control  funds  see  the  need  to  commit 
scarce  resources  to  this  type  of  service. 

Background: 

The  University  of  Guelph  is  a  major  research  institution  in 
Canada,  with  approximately  $81  million  dollars  in  research 
grants  per  year.  The  undergraduate  enrollment  is 
approximately  10,500  students  with  another  2,000+ 
graduate  students.  The  University  is  broken  up  into  six 
separate  colleges  including  the  College  of  AppUed  and 
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Human  Sciences,  Ontario  Agricultural 
College,  Ontario  Veterinary  College, 
College  of  Biological  Sciences,  College 
of  Physical  and  Engineering  Sciences  and 
the  College  of  Arts.  Recently  several 
new  remote  campuses  were  added  that 
deal  specifically  with  agriculture.  There 
■l^BB^^^^     are  also  significant  ties  with  the  Ontario 
Ministry  of  Agriculture  and  Rural 

Affairs.  The  OMAFRA  head  office  was  recently  relocated 

to  Guelph. 

At  the  moment,  the  biggest  users  of  DRC  services  seem  to 
come  from  the  first  two  Colleges,  although  clients  are 
spread  amongst  all  groups.  The  nature  of  DRC  holdings 
and  background  of  staff  members  has  lead  to  heavier  use 
from  departments  such  as  Economics,  Geography, 
Sociology,  Rural  Planning  and  Development,  Agricultural 
Economics  and  Consumer  Studies.  As  the  service  grows 
and  new  contacts  are  made,  and  a  new  populadon  of  users 
is  developed,  it  is  expected  that  this  will  change.  It  has 
been  found  that  one  of  the  most  important  tasks,  and  a 
possible  problem,  is  informing  the  user  community  about 
what  is  being  done.  Experience  to  this  point  has  been  that 
once  contact  is  made  with  people,  and  there  is  an  actual 
demonstration  of  the  capabilities  of  the  DRC,  response  is 
extremely  positive. 

The  process  of  estabUshing  a  DRC  at  the  University  of 
Guelph  actually  began  in  the  fall  of  1993.  The  pilot  project 
did  not  begin  until  December  1996.  In  order  to  get  support 
for  the  project  a  detailed  proposal  was  written,  outlining  all 
of  the  possible  options  for  running  a  DRC.  A  great  deal  of 
this  information  was  developed  from  a  workshop  offered 
during  the  summer  program  at  ICPSR' .  This  was 
augmented  with  tours  of  the  University  of  Toronto  Data 
Library  and  CHASS  facilities,  University  of  Western 
Ontario's  Social  Science  Data  Centre,  and  input  from  the 
Canadian  Association  of  Public  Data  Users  (CAPDU).  The 
proposal  was  very  detailed,  defining  the  users,  the  benefits, 
the  possible  levels  of  service,  considerations  for  a  suitable 
computing  environment,  the  departments  capable  of 
managing  the  service,  the  costs,  staffing,  and  what  other 
institutions  in  Canada  were  doing.  This  document  was 
used  as  background  information  to  justify  and  explain 
options  of  how  a  DRC  could  be  run. 
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Until  this  point  in  time,  individual  faculty  members  had 
been  responsible  for  managing  their  own  data  needs.  This 
usually  entailed  hiring  a  research  assistant  and  spending 
time  and  resources  getting  these  individuals  "tooled-up"  to 
using  whatever  data  they  needed^ .  The  net  result  was  that 
there  was  frequent  dupUcation  of  efforts,  especially  for 
major  data  sets,  and  there  was  a  great  deal  of  frustration 
for  many  researchers. 

The  ideas  presented  in  the  initial  proposal  were  well 
received  by  various  groups  and  individuals  on  campus. 
Generally  the  response  was  that  such  a  service  had  been 
needed  for  a  long  time  and  would  be  of  use  and  welcome 
on  campus.  The  problem  was  to  not  only  find  the  resources 
to  get  this  project  to  run,  but  to  find  sufficient  resources  to 
make  it  function  properly.  After  a  long  period  of 
preparation  and  waiting  for  the  right  circumstances,  it  was 
clear  to  those  involved  in  the  project  that  initial  impressions 
of  the  functioning  service  must  be  positive.  If  the  service 
lacked  support  from  the  beginning,  and  did  not  manage  to 
impress  the  various  stakeholders,  it  would  be  very  difficult 
to  attract  and  maintain  the  cUent-base  needed  to  develop  a 
commitment  in  the  long  term.  The  pilot  project,  it  was 
clear,  was  going  to  be  an  all  or  nothing  simation. 

It  took  approximately  four  years  to  move  from  the  initial 

ideas  to  the  start  of  the  actual  pilot,  during  which  there  was 

a  great  deal  of  lobbying  done  by  the  interested  parties.  In 

the  mid  1990's  the  University  was  developing  a  detailed 

strategic  plan  and  the  need  to  address  electronic 

information  was  included  as  a  small  paragraph  in  the  plan. 

This  was  an  important  step,  as  it  became  clear  that  the  idea 

of  a  Data  Centre  was  becoming  central  enough  to  the 

University's  plans  that  it  was  being  discussed  in  various 

high-level  committees.  Partly  in  response  to  this, 

a  beta  version  of  the  web  retrieval  system  was 

developed  over  a  few  days-^ .  This  was  extremely 

useful  for  presentations,  and  faculty  were  able  to 

clearly  see  the  potential  and  the  possible 

apphcations  of  this  system.  With  a  working 

prototype  in  place,  it  was  much  easier  to  generate 

enthusiasm  about  what  was  being  proposed,  and 

a  number  of  presentations  to  potential  users 

where  very  successful  during  this  period.  This 

was  also  about  the  time  that  the  Data  Liberation 

Initiative  was  being  estabUshed  by  Statistics 

Canada,  with  the  basic  idea  of  making  the  data 

more  available  and  universally  accessible.  It  was 

very  clear  that  the  University  needed  something 

like  a  DRC  to  take  advantage  of  the  opportunities 

that  were  being  presented.  At  the  same  time 

there  was  a  movement  at  other  institutions  in 

Canada  to  estabUsh  data  centres.  All  of  these 

factors  helped  to  make  the  Data  Centre  seem  hke 

a  feasible,  and  particularly  timely,  project. 

The  initial  proposal  was  a  for  a  collaborative 


effort  between  the  Library.  Computing  and 
Communications  Services  and  the  College  of  Social 
Science.  The  basic  resources  being  committed  included  the 
secondment  of  staff,  infrastructure  money  (which  was  very 
Hmited),  and  physical  space  to  house  the  centre.  Central 
computing  facihties  such  as  a  UNIX  system  and  software, 
already  in  place,  were  also  used.  The  collaboration  between 
the  Library  and  Computing  Services  (The  College  of  Social 
Sciences  dropped  out  early  on)  was  very  important  in  that  it 
brought  together  a  diversity  of  skills  that  is  still  reflected  in 
our  current  staff.  Our  planning  had  always  taken  into 
account  that  useful  skills  could  be  drawn  from  the 
computing  and  library  fields,  and  that  input  from  the  user 
community  is  vital.  The  latter  includes  the  group  that  uses 
the  information,  be  it  researchers  or  teachers.  As  pointed 
out  by  Kroeker  (1997)  you  need  to  know  your  patrons  and 
what  their  needs  are. 

Figure  1  gives  an  example  of  the  environment  that  existed 
prior  to  the  estabUshment  of  the  DRC.  The  University  can 
be  simpUfied  by  dividing  everyone  into  4  groups.  Assume 
students  gain  access  through  any  of  the  four  groups.  These 
4  groups  communicate  in  an  ad-hoc  fashion,  as  depicted  by 
the  dashed  lines.  Note  that  there  is  no  direct 
communication  between  type  A  and  type  B  researchers- 
teachers.  The  distinction  between  the  two  groups  is  that 
type  B  have  direct  access  to  incoming  data.  This  may  be 
due  to  certain  skills  they  posses,  or  their  access  to  resources 
needed  to  acquire  this  data.  The  problem  hes  in  the  fact 
that  data  flows  into  either  the  library,  or  type  B  researchers/ 
teachers.  It  is  not  clear  where  data  will  finally  reside  and 
the  information  flow  related  to  data  is  poor.  This  is 
especially  true  between  group  A  and  B. 
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MAJOR  FUNCTIONS  OF  A 
DATA  LIBRARY 

-  identification 

-  acquistion 

-  storage 

-  verfication 

-  documentation 

Colleclion 
Care 

-  location              1 

-  indexation          1 

-  cataloging          1 

-  system  file  generation 

-  subsetting 

-  special  purpose  software 

-  general  user  documenation 

User 
Services 

-  consultation                           ■ 

-  orientation                              1 

-  training  (user  and  staff)        1 

-  cleaning 

-  inventorying 

-  archiveing 

-  promoting  national  standards 

-  inter  institutional  cooperation 

Archives 

figure  2 

Services 

A  paper  by  Jacobs  (1991)  gives  a  very  good  outline  of  the 
needs  and  ways  to  deliver  data.  Jacobs  lists  users 
expectations  and  the  services  associated  with  these 
expectations,  breaking  it  down  into  general  library  services, 
references  services  and  computing  services 

Early  on  in  the  proposal  stage  there  was  a  need  to  clearly 
define  the  services  that  were  going  to  be  offered  at  Guelph. 
Figure  2  outlines  different  levels  of  service  associated  with 
a  data  Hbrary  (see  Runs  (1990)).  Basically,  the  DRC  was 
prepared  to  undertake  most  aspects  associated  with 
collection  care  and  user  services.  However,  there  was  no 


commitment  to  archiving  services.  Over 
the  first  year  this  decision  was 
reconsidered.  It  has  been  found  that  there 
is  a  demand  for  these  services  and  in  the 
summer  of  1 997  there  was  a  grant  to  hire  a 
graduate  student  to  begin  archiving 
historical  census  records  from  19"'  century 
Canada.  It  is  believed  that  the  DRC  will 
continue  to  evolve  in  this  direction. 

Another  interesting  development  that 
occurred  was  related  to  the  user 
community.  Initially  it  was  believed  that 
the  heaviest  users  would  be  faculty, 
researchers  and  graduate  students. 
Apphcations  related  to  teaching, 
particularly  in  applied  undergraduate 
courses  such  as  statistics  and  upper  year 
research  courses,  have  made  use  of  the 
DRC.  This  tied  in  nicely  with  the  services 
being  provided  in  the  Government 
Documents  section  of  the  library,  where 
the  traditional  paper-based  statistical 
sources  are  housed,  and  where  many  of  the 
supporting  documents  that  DRC  users 
would  require  could  be  found.  It  was 
suspected  that  there  would  still  be  a  large 
number  of  'one-off  type  questions  that 
would  be  more  efficiently  answered  using 
the  traditional  hard-copy  sources.  For 
example,  questions  like  the  population  for 
a  given  CD,  CSD,  or  EA  might  best  be 
addressed  using  traditional  sources.  In 
these  cases  the  user  was  often  looking  for 
one  number,  or  even  just  a  few  numbers. 
However,  the  ease  and  flexibility  of  the 
web  retrieval  system  has  opened  up  the 
DRC  for  these  types  of  questions.  One  of 
the  challenges  has  been  deciding  when 
users  should  refer  to  the  DRC  in  order  to 
get  the  quickest  and  most  efficient 
response  and  when  they  should  refer  to 
traditional  sources. 


The  DRC  is  centered  around  the  WWW*. 
Expectations  are  that  users  will  become  self-sufficient  in 
finding  and  extracting  information.  This  is  similar  to  the 
objectives  of  other  Data  Centres  (see  Kroeker  (1997)).  The 
feeling  was  that  if  we  established  a  DRC,  demand  was  so 
high  that  we  could  easily  spend  all  of  our  human  resources 
answering  requests  without  expanding  the  information 
available.  Initially  the  DRC  did  not  have  any  public  hours 
for  walk-in  consultation,  and  once  it  did,  these  hours  were 
limited.  This  allowed  staff  to  get  a  jump  on  providing  self- 
help  information  for  users,  get  comfortable  with  the  DRC's 
services  themselves,  and  have  some  time  for  some  early 
fine-tuning  of  the  service  before  declaring  the  service  fully 


■ASSIST  Quarterly 


functional. 

The  DRC  web  site  has  gone  through  several  iterations  since 
it  appeared  in  the  first  week  of  operations  having,  at  that 
point,  been  created  with  minimal  content  behind  it.  It  has 
developed  "on-the-fly"  in  response  to  demand,  feedback, 
and  experiences  in  a  live  web  environment,  with  the 
guiding  principle  being  that  web  sites  should  be  dynamic, 
and  regularly  edited  to  fit  the  changing  demands  or  the 
latest  ideas  of  users  or  staff.  Recently  there  was  a  major 
overhaul  to  simplify  the  site.  Most  of  the  changes  were  a 
direct  result  of  user  input,  and  studying  other  sites  on  the 
net.  The  main  page  hnks  the  user  to  5  major  areas: 

The  first  area  takes  users  to  the  on-line  data  holdings, 
which  includes  access  to  the  web  retrieval  system 
(discussed  later),  CD-Rom  products  available  over  the  net, 
access  to  GIS  data  from  the  Census  and  any  on-hne 
services  that  are  subscribed  to,  such  as  CANSIM. 

The  second  area  takes  the  user  to  information  and  links  to 
all  the  CD-Rom  holdings.  If  they  have  not  been  made 
available  over  the  net  then  there  are  instructions  on  how  to 
access  the  information. 

The  third  area  Unks  the  user  to  information  on  data  from 
consortia  agreements  which  include  ICPSR  and  DLL 
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figure  3 


There  are  direct  Unks  to  sites  where  the  user  can  search 
holdings.  Only  a  small  fraction  of  this  data  is  stored  locally 
and  is  usually  obtained  on  a  request  basis. 

The  fourth  area  deals  with  external  data  sources.  As  time 
permits  staff  gather  links  to  sites  that  provide  free  access  to 
data  (some  commercial  sites  are  also  linked)  and  logically 
order  these  hnks  to  help  users  find  what  they  need.  This  is 
also  a  valuable  resource  for  reference  staff  within  the  DRC. 
More  often  than  not  this  list  is  expanded  as  sites  are 
discovered  on  routine  reference  questions.  The  page  is 
divided  into  categories  such  as  Economics,  Agriculture, 
Science  and  others.  The  final  area  Unks  the  user  to  other 
data  centres  and  data  related  sites  that  may  be  useful. 

The  office  of  the  DRC  is  located  adjacent  to  the 
Government  Documents  section  of  the  Library. 

Staff  in  the  Centre  have  workstations  and  desk  space  within 
this  office,  aside  from  the  Librarian  who  has  his  own  office 
in  this  area.  There  is  also  one  additional  workstation  that 
users  have  access  to,  if  needed.  The  idea  is  that  this  is  a 
point  of  reference  within  the  Ubrary  where  users  can 
contact  staff  in  person  or  by  phone.  There  is  also  a  work 
area  to  hold  meetings  and  discuss  data  related  problems. 
There  are  also  computer  pools  within  the  Ubrary  to  which 
users  can  be  directed  to  work  on  their  problems,  and 
recently  three  more 

workstations  were  added 

just  outside  the  DRC  that 
can  be  used  by  cUents  (they 
double  as  library  catalogue 
machines). 


The  web  server  for  the  DRC 
is  spUt  into  two 
components.  The  main 
portion  runs  on  an  NT 
server  located  in  the  DRC 
office.  This  server  stores 
data  that  come  with  a  pre- 
written PC/windows 
interface' .  This  server  also 
stores  CD-Rom" s  that  can't 
be  loaded  onto  the  WWW. 
There  is  a  CD-Rom  tower 
attached  to  the  server. 

The  buUcofthe  data 
holdings,  in  terms  of  both 
size  and  number  of  data  sets 
resides  on  a  central  Unix 
server  that  is  used  for 
running  statistical 
applications.  This  server 
was  already  in  existence. 
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and  the  DRC  web  retrieval  system  runs  along  with  other 
services.  This  makes  it  very  convenient  for  more 
experienced  users  to  by-pass  the  web  retrieval  system  and 
work  directly  on  the  data  with  centrally  maintained 
software.  This  will  be  discussed  more  later. 

Pilot  project 

Under  the  pilot  project,  a  format  similar  to  the  one  in  figure 
3  was  undertaken.  Essentially,  the  DRC  would  be  a  service 
offered  through  the  library,  and  as  such,  was  physically 
located  in  the  main  library.  There  was  direct 
communication  with  all  researchers,  CCS  and  the  library. 
All  data  would  be  channeled  through  the  DRC,  so  that 
everyone  was  aware  of  what  was  available  and  where  to 
find  it..  The  Ashton  Lab  already  existed,  providing 
advanced  consultation  with  researchers  dealing  with  data 
collected  in  the  field.    Clients  also  have  access  to  SAS/ 
SPPS  help  through  Central  Computing  Services.  The  DRC 
does  not  provide  these  services. 

Staff  for  the  DRC  were  assigned  from  both  the  library  and 
CCS.  For  a  more  detailed  break-down  of  tasks,  refer  to 
Appendix  A.  Currently,  the  DRC  has  a  full-time  Systems 
Analyst  assigned  by  CCS  as  Project  Leader.  This  person  is 
essentially  responsible  for  coordinating  the  project  and 
participates  in  all  aspects  of  the  DRC,  including  interaction 
with  researchers  using  larger  data  sets  available  through  the 
DRC.  CCS  has  also  suppUed  a  0.5  fte  Systems  Analyst, 
whose  main  task  is  managing  and  writing  the  web  retrieval 
system.  The  hbrary  has  assigned  a  0.5  fte  Librarian  who 
has  experience  in  Government  Documents  and  working 
with  CD-ROMs.  This  person  coordinates  the  DRC  within 
the  library  and  handles  all  CD-ROM  issues.  The  hbrary 
has  also  assigned  a  0.5  fte  library  associate  with  experience 
in  the  Government  Documents  section.  This  person 
functions  as  a  reference  person  for  clients  coming  into  the 
DRC,  and  helps  to  bridge  the  gap  between  the  traditional 
collection  and  the  DRC  collection.  These  assignments  are 
best  case  situations.  There  is  rarely  2.5  fte's  available  in 
any  given  week.  On  occasion  funds  are  secured  for 
students  to  work  on  specific  applications  such  as  1991 
Census  GIS  files,  Historical  Census,  and  HIFE  files . 

Web  Retrieval  System 
The  System 

A  large  portion  of  the  efforts  in  the  DRC  are  centered 
around  the  development  of  a  web  retrieval  system  .  A  perl 
script  has  been  developed  to  provide  a  web-based  interface 
with  SAS.  This  allows  an  enormous  variety  of  data  to  be 
easily  mounted,  distributed  and  analyzed  on-hne.  In  the  14 
months  since  the  first  iteration  of  the  script,  over  200 
surveys  have  been  mounted  and  made  available. 

There  are  many  objectives  in  running  a  WWW  interface  to 
the  data.  The  interface  allows  simple  point  and  click  access 
to  a  variety  of  data  sets.  Users  are  able  to  select  a  subset  of 
variables,  draw  a  sample  based  on  conditions  of  certain  pre- 


defined variables* ,  output  to  approximately  30  different 
formats'  and  perform  simple  statistics  on  their  subsets.  At 
this  point  this  includes  frequencies,  crosstabs,  means  and 
simple  regressions.  However,  most  importantly  the 
interface  is  consistent  across  all  the  data  sets.  Several  data 
suppliers  are  developing  very  good  interfaces  to  their  own 
data.  One  of  the  problems  with  this  arrangement  is  that 
there  is  always  a  cost,  no  matter  how  intuitive,  to  learning  a 
new  interface.  Users  seem  to  appreciate  the  consistency 
and  the  ability  to  easily  move  from  one  data  set  to  another 
that  is  provided  with  our  single  web  interface. 

In  addiuon  to  the  above  features  the  retrieval  system  also 
gives  users  access  to  electronic  codebooks,  record  layouts, 
users  guides,  SAS  contents  tiles'",  and  sample  SAS  and 
SPSS  programs.  These  programs  can  be  transferred  to  the 
user's  own  PC  or  central  UNIX  account.  These  programs 
can  be  used  as  a  template  on  the  user's  own  UNIX  account 
to  run  and  read  the  data  as  defined  by  them.  The  system 
also  points  the  user  to  the  raw  data  files,  the  SAS  datasets, 
the  associated  format  and  index  files. 

All  of  the  data  is  stored  in  compressed  SAS  datasets  that 
are  fully  readable  by  anyone  with  a  central  UNIX  account. 
Variables  in  these  data  sets  normally  have  labels  and  many 
have  associated  value  statements.  The  degree  of 
completeness  depends  on  what  is  available  from  the 
supplier  of  the  data.  There  is  a  large  variety  in  DLI  and 
ICPSR  data.  Staff  are  currently  in  the  process  of  indexing 
the  larger  data  files  to  significantly  improve  retrieval  times. 
As  with  the  subset  variables  it  is  not  efficient  to  index  by 
every  possible  term  ,  so  an  attempt  is  made  to  try  and 
choose  the  2  to  6  variables  that  satisfy  90%  of  the  requests. 
Currently  the  raw  data  file  is  also  stored  with  the  SAS  data 
set  for  users  who  prefer  to  write  their  own  programs.  Disk 
space  is  relatively  cheap,  but  there  may  be  constraints  in 
the  future. 

The  Users 

Essentially  the  script  serves  two  types  of  clients.  The  first 
are  those  individuals  who  simply  want  a  table  or  even  a 
single  number  and  are  not  willing  to  wait  or  perform  a  very 
complex  procedure  to  get  output.  In  many  instances  it  may 
be  faster  to  look  up  the  table  or  number  in  a  hard  copy 
source.  It  must  be  kept  in  mind,  however,  that  the  web- 
retrieval  system  has  a  seemingly  infinite  number  of  custom 
tables  available,  whereas  in  hard  copy  the  number,  and 
nature  of  tables  made  available  is  decided  upon  by  the 
publisher. 

The  second  type  of  user  is  the  researcher,  who  is  looking 
for  data  on  which  to  perform  some  analysis" .  Experiences 
at  Guelph  suggest  that  many  researchers  have  a  problem 
when  dealing  with  empirical  work.  There  is  an  initial  cost 
to  getting  a  feel  for  the  data,  figuring  out  the  record  lay-out, 
writing  a  program  to  read  in  relevant  data,  and  then 
performing  some  simple  summary  statistics  on  the  data. 
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Once  this  is  done,  and  the  data  is  sufficiently  massaged  into 
a  format  they  can  use,  the  more  detailed  analysis  begins. 
The  DRC  concentrates  on  helping  with  the  first  part  of  this 
process.  The  web  interface  makes  it  extremely  easy  for 
anyone  to  go  in  and  'play  around'  with  the  data  before  they 
decide  what  they  want.  Essentially  there  are  now  decreased 
costs,  which  allows  more  time  for  the  more  complicated 
analysis,  as  well  as  expanding  what  the  user  may  attempt. 
Once  the  detailed  analysis  begins,  staff  forward  problems 
to  the  Data  Analysis  Support  group  or  the  Ashton 
Statistical  Laboratory. 

The  Administrator 

From  an  administrative  point  of  view  the  system  is 
extremely  simple  and  flexible.  One  script  handles  all  the 
different  formats  of  data.  When  a  data  set  is  added  to  the 
system  a  series  of  form  files  are  created  containing 
information  on  available  variables,  labels,  variables  to 
subset  by,  and  their  possible  values.  These  files  can  be 
easily  generated  from  the  SAS  program  and  contents  file 
using  a  variety  of  simple  editor  commands.  Once  created, 
information  such  as  directory  location,  data  name,  weights, 
whether  to  use  scroll,  input,  or  pull  downs,  and  how  to 
place  boxes  on  the  screen  is  quickly  added.  There  is  no 
modification  of  the  perl  script  necessary  to  deal  with  data 
sets.  Data  such  as  the  1992  FAMEX  survey  have  been 
added  in  as  Uttle  as  1.5  hours.  This  includes  downloading 
the  data,  creating  the  SAS  dataset,  and  mounting  it  on  the 
web. 

In  terms  of  security,  access  to  the  system  is  protected  by  IP 
address  at  the  directory  level.  The  use  of  Apache  software 
to  drive  the  web  server  allows  us  to  place  .htaccess  files  in 
any  directory  to  Umit  or  give  access  to  users.  The  perl 
script  is  also  capable  of  checking  IP  addresses  based  on  the 
data  being  downloaded.  These  options  allow  us  to  open 
and  close  access  to  different  data  sets  as  the  need  arises. 
This  will  be  particularly  useful  when  we  move  to  a  shared 
system  among  Universities  with  differing  Ucencing 
arrangements. 

An  area  that  is  still  being  developed  is  the  search 
capabibties.  This  was  intentionally  avoided  during  the  first 
year  of  development  as  the  priority  was  to  give  access  to  as 
much  data  as  possible.  Recently,  the  abiUty  to  do  searches 
by  keyword  and  strings  was  enabled  on  the  contents  files. 
The  results  of  the  search  are  presented  in  a  tabular  format 
where  the  user  is  Unked  to  the  contents  files  and  the 
associated  'readme"  file  for  that  data  set.  Shortly  there  will 
be  a  link  directly  to  the  web  retrieval  forms  for  this  data  set. 
The  files  that  are  searched  will  also  be  expanded. 

Integration  with  Library 

One  of  the  major  objectives  of  the  DRC  was  to  integrate 
data  identification  and  retrieval  services  with  other  services 
already  in  the  library.  The  web-based  nature  of  this  service 
means  that  every  workstation  in  the  library  (and  on 


campus)  is  a  potential  contact  point  with  the  DRC,  and 
library  staff  will  be  faced  with  data  that  is  integrated  with 
the  other,  traditional,  library  services.  The  fact  that  many 
reference  staff  in  the  hbrary  have  had  little  or  no  experience 
with  data  means  that  training  will  be  a  time-consuming 
process,  and  developing  a  reasonable  level  of  comfort  with 
this  type  of  information  will  have  to  occur  gradually.  This 
is  an  entirely  new  resource  for  the  reference  staff  member 
to  consider,  and  education  will  involve  not  only  instruction 
in  statistics,  but  more  basically  the  understanding  of  the 
appropriate  uses  of  data.  The  obvious  place  to  start  training 
was  with  staff  in  the  Government  Documents  section, 
where  staff  had  experience  with  the  paper-based  version  of 
the  data.  There  have  been  a  few  general  sessions  for  hbrary 
staff  on  what  happens  in  the  DRC  and  what  data  is 
available,  with  more  detailed  sessions  planned  for  the 
future.  A  great  deal  of  emphasis  has  been  placed  on  the 
notion  that  we  are  trying  to  make  users  self-sufficient,  and 
to  minimize  the  need  for  consultation  for  relatively 
straightforward  queries.  Training  sessions  tend  to  consist 
of  giving  general  outhnes,  asking  staff  to  try  to  use  the 
DRC  web  pages,  and  then  to  come  back  to  us  with 
questions.    With  lots  of  hand-holding  this  seems  to  be 
working,  although,  as  is  often  the  case,  the  technology  itself 
rather  than  the  nature  of  the  data  causes  problems. 

The  feed-back  from  these  staff  members  is  extremely 
useful  in  helping  set  a  direction.  In  other  words,  we  get  the 
users  to  tell  us  what  works  and  what  doesn't  work. 
Training  will  slowly  move  into  more  detailed  and  specific 
sessions  as  services  are  better  defined.  Up  until  this  point 
the  DRC  has  been  evolving  rapidly  and  changes  are 
frequent.  The  general  view  is  that  there  will  always  be  a 
need  for  speciaUzed  assistance  that  probably  will  not  reside 
in  the  general  reference  staff  of  a  hbrary. 

The  First  Year  -  Success  and  Areas  for  Improvement 

Overall  the  DRC  has  been  extremely  successful  and  in 
most  areas  we  have  progressed  well  beyond  expectations. 
There  was  a  lot  of  uncertainty  with  respect  to  how  we  were 
going  to  do  things,  but  everything  seemed  to  fall  into  place 
very  well.  This  can  be  attributed  to  a  few  things.  The  first 
was  the  level  of  technology  in  terms  of  software.  Perl, 
SAS,  DBMS  Copy  and  Apache  dehvered  what  was  needed. 

The  second  related  to  the  centralized  computing 
environment  that  existed  at  Guelph.  Although  there  were 
some  rough  spots  associated  with  system  security  and 
access  to  configuration  information  by  DRC  staff,  the 
resources  were  again  well  suited  for  what  was  needed. 
Delivering  the  data  from  a  centralized  UNIX  system, 
accessible  by  everyone  on  campus,  is  extremely  efficient. 

The  final  point  was  related  to  the  staff  involved.  As 
mentioned  earUer,  it  is  important  to  have  individuals  from 
the  computing  fields,  the  hbrary,  and  the  user  community, 
working  together.  The  DRC  was  lucky  to  get  a  group  of 
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people  who  not  only  technically  complimented  each  other, 
but  also  worked  very  well  together. 

One  of  the  biggest  surprises  has  been  how  easy  it  is  for 
administrators  to  add  data  to  the  retrieval  system.  The  way 
the  perl  script  has  been  written  allows  for  data  in  a  wide 
variety  of  formats  to  be  easily  mounted  for  retrieval.  At 
best,  expectations  were  that  a  few  dozen  data  sets  could  be 
mounted  during  the  first  year  and  it  would  be  difficult  to 
determine  which  ones.  Currently  there  are  well  over  200 
data  sets  on  the  system  and  the  effort  necessary  to  mount 
these  diminishes  as  staff  comfort  levels  increase  .  It  takes 
between  1  hour  and  2  days  (large  multiple-file  data  sets)  to 
prepare  data.  As  we  begin  moving  back  in  time  and 
mounting  'old"  data,  that  lacks  SAS/SPSS  code  and  doesn't 
have  electronic  codebooks.  the  process  gets  more  difficult. 
It  is  so  easy,  we  are  starting  to  train  Ubrary  staff  with 
limited  SAS  and  UNIX  skills  to  mount  data. 


however,  is  in  the  identification  of  data  to  suit  a  query.  It  is 
hoped  that  an  efficient  search  capability  for  the  system  may 
also  make  the  selection  of  data  possible  for  the 
inexperienced  user.  We  have  started  in  this  direction  and 
hope  to  make  progress  over  the  summer. 

Possibly  the  biggest  project  to  date  will  be  the  sharing  of 
this  resource  with  other  institutions.  The  web-based  system 
is  ideal  for  taking  advantage  of  the  potential  efficiencies  of 
a  joint  service.  Discussions  are  well  under  way  to  develop 
this  service  into  a  seamless,  shared  resource  between  the 
University  of  Guelph,  University  of  Waterloo  and  Wilfrid 
Laurier  University.  The  end  result  will  be  a  better  service 
for  all  parties  involved.  It  his  hoped  that  eventually  other 
such  centres  will  appear  in  Canada  and  the  workload  and 
overhead  can  be  shared  between  even  more  institutions. 


Another  surprise  has  been  the  speed  of  extraction.  The 
current  system  runs  on  a  fairly  slow  UNIX  server. 
However,  by  indexing  the  data  the  response  time  can  be 
improved  significantly.  The  functionality  of  the  retrieval 
system,  in  terms  of  output  formats,  and  summary  statistics, 
is  also  beyond  expectations. 

Some  of  the  areas  we  still  need  to  improve  are  related  to 
publicity.  It  is  still  difficult  to  get  people  to  understand 
what  is  being  done  in  the  DRC.  As  soon  as  we  get  1  on  I 
contact,  or  demonstrate  the  system  to  a  class,  it  becomes 
easy,  and  the  users  become  largely  self-sufficient.  A  goal 
is  to  get  out  to  classes  more  often,  and  to  make  the  DRC  a 
standard  tool  for  the  completion  of  assignments.  We  are 
also  continuing  to  publish  a  newsletter  each  semester 
outlining  developments  and  what  people  are  doing. 

The  service  is  also  not  without  it's  detractors.  Some 
researchers  who  are  experienced  with  working  on  large 
data  sets,  and  using  SAS,  SPSS  do  not  always  see  the 
benefits.    It  is  very  difficult  getting  these  users  to 
understand  that  the  structure  of  the  whole  system  can  help 
even  those  who  want  to  do  their  own  extractions'- .  In 
some  cases  we  spend  more  time  with  these  experienced 
users  than  with  'new'  users  who  readily  accept  the  system. 

What  is  Next? 

As  mentioned  above,  we  are  progressing  much  faster  than 
we  initially  expected,  but  there  are  several  things  that  still 
need  to  be  worked  on.  Some  of  these  were  mentioned  in 
section  7,  related  to  pubUcity  and  staff  training.  Other 
areas  include  increasing  the  functionality  of  the  web 
retrieval  system,  expanding  on-Une  statistical  options, 
particularly  related  to  graphing,  and  possibly  interfacing 
better  with  CIS  systems.  Work  also  needs  to  be  done  with 
on-line  keyword  searching  for  information.  The  current 
system  works  well  if  you  know  what  data  you  are  interested 
in.  One  area  in  which  much  consultation  is  still  needed, 


APPENDIX  A  -  brief  job  descriptions 

Project  Leader  -  I  FTE  -  Systems  Analyst,  CCS 
Tasks: 

Coordination  of  project;  liaison 
between  CCS,  Library  and  user  community;  participate  in 
management  group:  report  to  SAC;  periodically  report  to 
College  IT  Committees;  liaison  with  outside  groups  such  as 
C  APDU,  DLI,  ICPSR,  Statistics  Canada;  coordinate  joint 
ventures  with  WLU  and  Waterloo;  control  inflow  of  data 
from  outside  sources  -  ie  download  from  DLI  and  ICPSR; 
participate  in  data  purchase  agreements  and  consortiums 
(work  with  other  data  centers  on  national  issues);  provide 
user  support  and  consulting  for  staff,  researchers  and 
students;  participate  in  development  of  front  end 
applications;  participate  in  production  of  newsletter  and 
annual  report;  assist  other  DRC  staff  as  needed. 

WWW  resource  person  -  .5  FTE  -  Systems  Analyst,  CCS 

Tasks: 

Incorporate,  develop  and  maintain 
web  retrieval  interfaces  for  electronic  data  resources; 
implement  a  process  for  controlUng  access  to  data; 
implement  a  process  for  measuring  usage;  limited  user 
support  and  consulting  services  for  end  users;  manage  NT 
server  and  various  data  products 
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Librarian  -  .5  FTE  -Library 


Tasks: 


Overall  coordination  of  "library  specific" 
side  of  project;  user  support  and  consultation;  addition  of 
data  to  web  site,  coordinate  CD-ROM  products  and 
acquisition;  establish  and  develop  communication  between 
DRC  and  the  rest  of  the  library';  participate  in  planning  of 
layout  and  design  of  service  point;  develop  and  participate 
in  training  of  library  staff  (classes  and  production 
materials);  work  with  Ubrary  staff  on  publicity  and 
information;  work  with  Acquisitions  staff  to  bring  data 
acquisition  in  line  with  acquisition  of  other  library 
materials;  work  with  Cataloguing  staff  to  develop  workable 
method  of  cataloguing  electronic  data  sets;  keep  up  to  date 
with  development  of  data  resources  available  on  the 
Internet. 

Library  Associate  -  .5  FTE  -  Carol  Perry  -  Library 

Tasks: 

Link  between  'hard-copy'  reference 
and  DRC  collection;  user  support  and  consultation;  adding 
data  to  web  site;  pubhcity  (newsletter);  backup  strategies; 
web  design  and  graphics;  WWW  searching  and  inventory 
(collect  resources,  data  sites,  and  useful  sources  of  related 
information);  participate  in  training  of  library  staff; 
maintain  hard  copy  collection  of  codebooks;  keep  statistics 
of  patron  traffic  and  data  use. 
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'  In  this  paper  we  define  a  DRC  as  a  point  of  service  where 
users  are  able  to  get  access  to,  and  assistance  with  the  use 


of,  electronic  information  such  as  the  census,  general  social 
survey,  survey  of  consumer  finances  and  so  on.  We  do  not 
include  resources  such  as  electronic  journals  and  books. 
Experience  suggests  that  users  are  frequently  confused 
about  this  distinction. 

-  It  is  felt  that  the  benefits  are  well  defined  and  understood, 
and  as  such  will  not  be  discussed  in  detail.  The  emphasis 
will  be  on  the  process  rather  than  the  justification. 
However,  it  is  essential  to  have  a  thorough  understanding 
of  them  and  for  more  detail  see  Home  et  al.  (1996). 
Lubanski  (1996)  gives  some  justification  for  increased 
funding,  highhghting  the  increases  in  empirical  research. 

-  The  workshop  was  put  on  by  Diane  Geraci  (Suny 
Binghampton),  Chuck  Humphrey  (University  of  Alberta) 
and  Jim  Jacobs  (UC  San  Diego). 

■*  During  a  presentation  of  the  web  retrieval  system  at 
another  University  there  was  a  very  positive  response  in 
this  direction.  The  faculty  member  was  elated  because  they 
could  now  sit  down  at  their  workstation  and  retrieve 
subsets  of  the  data  in  a  matter  of  minutes.  Up  to  this  point 
they  would  spend  valuable  resources  on  graduate  students 
and  many  times  they  ended  up  with  nothing  at  the  end  of 
the  project. 

-  The  basic  ideas  for  this  were  obtained  from  sample  perl 
scripts  and  SAS  programs  written  at  Kansas.  They  were 
available  over  the  WWW.  This  was  a  very  cmde 
implementation  of  a  interface  between  SAS  and  the  WWW. 

'  See  http://drc.uoguelph.ca 

'  Many  of  Statistic  Canada's  data  products  now  come  in  a 
format  that  uses  Ivision's  Beyond  20/20  browser.  We  find 
this  a  very  useful  interface  for  many  queries,  but  stress  it  is 
only  a  comphment  to  our  web  retrieval  system.  It  is  not  an 
efficient  format  to  deal  with  may  research  type  questions 
that  are  posed  in  the  academic  environment. 

*  Staff  decided  early  in  the  process  to  keep  the  retrieval 
forms  as  simple  as  possible.  As  such,  each  data  set  has  a 
small  pre-determined  set  of  variables  to  subset  by.  In  the 
case  of  time  series,  this  would  include  months  and  years, 
whereas  in  the  case  of  most  cross-sections  something  like 
age,  region,  education  and  gender  are  chosen.  The 
objective  is  to  try  and  satisfy  the  most  possible  requests 
with  the  smallest,  most  commonly  used  subset  variables.  If 
a  user  determines  they  would  like  to  subset  by  some  other 
variable,  such  as  income,  this  can  easily  be  added.  The 
way  the  script  is  written  it  takes  about  2  minutes  to  add 
another  subset  variable.  The  necessary  'form'  file  is 
created  from  the  'values'  section  of  the  SAS  program. 

'  Examples  include:  SAS,  SPSS,  SPPS  portable.  Gauss, 


Summer  1998 


ST  ATA,  Lotus,  Excel,  Quattro,  Dbase,  ASCII  and  many 
others. 

'°  These  files  are  used  to  easily  access  information  on 
variable  names,  labels  and  formats,  as  well  as  sample  size. 

"  This  researcher  could  be  very  experienced  and  not  need 
much  assistance  or  they  could  be  a  student  in  an  applied 
course  who  has  never  worked  with  this  type  of  data  before. 
In  the  later  case  the  web  retrieval  system  allows  the  user  to 
easily  (and  painlessly)  obtain  the  data  and  move  on  to  more 
important  tasks,  central  to  the  course. 

'-  For  example;  codebooks,  user  guides,  record  layouts, 
SAS  and  SPSS  code  ready  to  read  the  data  and  even  the 
data  already  in  SAS  data  sets  with  formats  and  labels  are  all 
available  through  the  web  site. 

■  Paper  presented  at  I  ASSIST  1998  Yale  University  May 
1998.  Bo  Wandschneider  Systems  Analyst  Data  Resource 
Centre  University  of  Guelph,  Doug  Home  Librarian  Data 
Resource  Centre  University  of  Guelph. 
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Global  Access  to  Data  Resources: 

Where's  the  Metadata? 


Howe  and  Graham  (1993)  proposed  that 
"the  goal  for  the  use  of  metadata  and  the 
development  of  user  interfaces  should  be 
nothing  less  than  permitting  everyone 
from  the  novice  to  the  expert  to  function 
independently  at  a  desktop  machine." 
They  identified  three  problems  that 
needed  to  be  addressed: 
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storage,  is  connected  to  a  Windows  NT 
file  server  that  It  is  also  easy  to  forget  the 
fact  that  the  recent  success  of  Java 
promises  that  access  to  metadata  via  the 
Web  can.  in  theory,  be  unfettered  by 
operating  system  or  platform  differences. 


•  Metadata  must  be  transportable  from  platform  to 
platform. 

•  There  will  be  pressure  on  interface  designers  to  make 
interfaces  ever  smarter,  as  more  and  more  naive  users 
access  metadata. 

•  Metadata  will  vary  in  quality,  depending  largely  upon 
whether  the  research  team  intended  the  study  to  be 
available  for  secondary  analysis. 

The  purpose  of  this  paper  is  to  reassess  where  the  social 
science  community  is  with  respect  to  the  above  issues. 
Throughout  this  paper,  we  will  be  careful  to  distinguish 
between  studies  intended  for  use  in  secondary  analysis  and 
other  studies.  One  of  our  themes  is  that  tremendous 
progress  has  been  made  over  the  past  five  years  with 
respect  to  data  sets  intended  for  use  by  secondary  analysts. 
In  contrast,  very  Little  progress  has  been  made  with  respect 
to  the  problem  of  making  metadata  available  for  the  tens  of 
thousands  of  other  studies  published  each  year  in  the  social 
sciences. 

Transportability 

Of  the  three  issues  identified  by  Howe  and  Graham  (1993), 
the  greatest  amount  of  progress  has  been  made  in  terms  of 
the  transportability  of  metadata  (and  data).  This  is  not  to 
say  that  the  problems  have  all  been  resolved,  but  it  is  now 
possible  to  imagine  a  future  in  which  transportability  is  a 
non-issue.  While  researchers  have  been  able  to  routinely 
transmit  error-free  data  around  the  world  for  the  past  six  or 
seven  years,  it  has  only  been  in  the  past  two  years  or  so  that 
the  problem  of  data-storage  has  been  solved. 

The  University  of  Cincinnati  has  recently  purchased  an  HP 
330FX  Optical  Storage  Jukebox  to  store  its  social  science 
data  collection.  The  Jukebox,  with  330GB  of  direct  onhne 
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The  UC  system  compares  very  favorably 
to  almost  any  other  data  archive  in  terms  of  access  to 
secondary  data.  However,  as  these  kinds  of  storage  devices 
become  more  commonplace,  there  have  been  no 
comparable  improvements  in  the  quality  of  user  interfaces 
to  access  data  sets.  With  few  exceptions,  user  interfaces 
have  not  progressed  appreciably  in  the  last  five  years.  The 
University  of  Michigan  has  developed  impressive  web  sites 
for  analysis  and  extraction  of  data  from  the  General  Social 
Survey  and  the  American  National  Election  Study.  The 
Bureau  of  Economic  analysis  has  marginally  improved  user 
access  to  the  Regional  Economic  Information  System 
(REIS)  CD-ROM  with  the  release  of  a  Windows  interface. 
Unfortunately,  The  Bureau  of  the  Census  GO/Extract 
combination  and  the  National  Center  for  Health  Statistics 
SETS  software  have  remained  essentially  unchanged  over 
the  last  several  years. 

ICPSR  also  seems  to  be  staking  out  a  position  that  is 
distinct  from  that  of  industry.  The  ICPSR  Data 
Documentation  Initiative  is  moving  in  a  direction  away 
from  that  of  software  developers  such  as  Microsoft  and 
SAS  (Microsoft  and  SAS  are  both  members  of  the  Meta 
Data  Interchange  Specification  Initiative). 

While  there  have  been  modest  advances  in  the  ways  that 
statistical  software  packages  such  as  SAS  and  SPSS  permit 
the  analyst  to  make  use  of  metadata  in  working  with  a  set 
of  data,  packages  have  by  and  large  remained  stagnant  in 
the  amount  and  types  of  metadata  they  support.  Most 
packages  do  a  very  poor  job  of  supporting  any  type  of 
metadata  beyond  what  can  be  considered  "data  definition 
metadata"  (i.e.,  variable  labels,  value  labels,  missing  value 
definitions,  etc.),  and  even  with  respect  to  these  kinds  of 
metadata,  the  packages'  capabilities  are  nearly  identical  to 
what  was  available  a  decade  ago,  ahhough  more  of  this 
information  is  available  in  point-and-cUck  interfaces. 

Perhaps  most  importantly,  none  of  the  major  packages  have 
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produced  any  revolutionary  new  tools  for  capturing 
metadata  that  archivists  will  need  for  bibliographic 
purposes  or  that  secondary  analysts  will  need  for  planning 
their  work.  As  just  one  illustration  of  the  type  of  metadata 
sorely  needed  but  impossible  to  capture  in  these  packages 
is  information  about  skip  patterns.  On  the  one  hand,  it  must 
be  acknowledged  that  software  package  designers  must  feel 
frustrated  at  the  lack  of  standards  for  metadata  in  the  user 
community.  On  the  other  hand,  both  SPSS  and  SAS  did  at 
one  time  pace  the  user  community  in  terms  of  promoting 
better  and  better  data  definition  features. 

Variability  in  Metadata  Quality 

As  just  noted,  producing  metadata  for  a  set  of  data  in  1998 
is  not  remarkably  different  than  in  1968:  someone  involved 
in  the  process  of  research  data  management  has  to  do  a  lot 
of  typing.  As  a  result,  the  metadata  available  for  a  study 
varies  tremendously  in  quality,  ranging  from  very  good  for 
large,  government-sponsored  efforts  such  as  the  census  to 
very  poorly  for  the  student  who  has  never  been  taught  the 
fundamentals  of  research  data  management. 

There  is,  thus,  a  sharp  distinction  at  present  between  the 
accessibihty  of  data  resources  designed  for  secondary 
analyses  and  virtually  all  other  ones.  Data  sets  collected 
and  prepared  for  the  user  community  as  secondary 
resources  are  increasingly  available  via  the  Web  and  are 
slowly  becoming  more  and  more  accessible  to  end  user  as 
the  social  science  community  learns  what  constitutes  a 
useful  interface.  As  metadata  standards  become  better 
estabUshed  and  cataloging  tools  become  more 
sophisticated,  we  can  expect  the  pace  at  which  these  studies 
are  made  available  to  accelerate.  Ironically,  the  pace  at 
which  we  are  losing  primary  research  data  is  probably 
increasing.  More  and  more  research  is  published,  and  we 
would  guess  that  smaller  percentages  of  it  are  being 
archived. 


•  Create  topical  files  that  detail  study  information  - 
including  topics  such  as  sampling,  copies  of 
instruments,  relationships  between  study  data  sets, 
calculation  of  weights  and  standard  errors,  definition  of 
terms,  documentation  of  calculated  variables  or  fields, 
and  originating  hardware  and  software  platforms. 

•  Develop  data  definition  and  data  manipulation 
structure  -  including  definition  of  elements  and  element 
formats,  complete  labeling  information,  descriptive 
statistics,  and  free-field  explanatory  notes. 

A  well-developed  system  would  allow  the  researcher  or 
other  person  responsible  for  data  documentation  to  either 
create  a  default  minimal  metadata  collection  that  would 
provide  facilitate  subsequent  file  access,  or  create  very 
detailed  documentation  with  all  study  specifics  stored  as 
part  of  the  metadata  collection. 

We  also  need  to  work  harder  at  promulgating  research  data 
management  standards  and  encourage  professional 
associations,  journal  editors  and  funding  agencies  to  require 
the  archiving  of  research  data. 
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The  Future 

Our  common  goal  should  be  nothing  less  than  to  create 
metadata  and  user  interfaces  that  allow  the  community  of 
data  users  to  access  and  process  secondary  data.  Metadata 
standards,  although  varied  and  at  times  painfully  subject 
specific,  have  emerged.  Our  most  popular  data 
management  and  analysis  applications,  however,  continue 
to  lag  in  meeting  the  needs  of  the  social  science 
community. 

Our  recommendafion  for  solving  the  user  interface  problem 
is  unchanged  from  five  years  ago.  We  suggest  an 
interactive  program  shell  that  allows  both  the  researcher 
and  the  end  user  to: 

•  Enter  information  that  documents  the  bibliographic 
record  of  the  study,  including  study  title,  principal 
investigator,  year,  funding  source,  related  studies. 


■ASSIST  Quarterly 


1  D1  OIASS  JSTP 

p  1  D  1  C  1  C  '  O  '.  n 

I  C 1  CAPDU1  r 

p  1  c  1  c  '  c  '  n 

1  D  1  D  1  n  1  D  1  1« 
m  Dl D1  D1  D 
1  TDHQNITa' 
C  '  c  ■  C     C  ■  f 
1  c  I  n  1  a  1  c 
m  D  19^^ • 
1  c^  C'  ai  c 

Dl  Ol D1  D1  [ 

1  D  1  D  1  D  1  D  !  ^i     uJ 

D1  D1  D1  Dl  D1  ai  a 


lASSIST  -  CAPDU 


17-21  MAY  1999 
TORONTO  Canada 


The  International  Association  for  Social  Science 
Information  Service  and  Technology  (lASSIST)  and 
the  Canadian  Association  of  Public  Data  Users  (CAPDU) 
announce  their  joint  1999  conference,  "i9^//<9>/7^Z7/7;:/^6»5, 
breaking  barriers:  the  future  of  data  in  the  global 
network'. 

The  conference  will  be  held  May  16-21,  1999  on  the 
University  of  Toronto  campus  in  Toronto,  Ontario  and 
will  address  issues  of  computing  and  information 
services  in  social  science  research,  teaching,  and 
data  management. 

This  is  lASSIST's  25th  annual  conference,  and  the  ninth 
CAPDU  conference. 
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Protecting  Confidentiality  in  Archival  Data 

Resources 


Data  sharing  is  a  disputed  norm  in 
scientific  affairs  (Fienberg  et  al.  1985; 
Weil  and  Hollander  1990? Fienberg  1994; 
Mishkin  1995).  On  the  one  hand, 
principal  investigators  argue  that  they  and 
their  research  teams  are  the  most 
competent  analysts  of  originally  collected 
data  and  best  able  to  safeguard  the  data  ^^^^^^^^ 

against  release  of  confidential 

information.  They  know  the  details  and  nuances  of  the 
sampling  procedures,  instrumentation,  data  reduction,  and 
missing  data.  They  have  an  investment  in  the  original 
research  that  should  be  repaid  by  first  rights  of  publication. 
They  also  argue  that  for  certain  kinds  of  complex  studies, 
for  example,  observational  research,  organizational 
research,  longitudinal  research,  clinical  research,  and 
research  involving  geo-coded  data  or  administrative  records 
linked  to  survey  data,  they  are  the  only  or  principal 
safeguard  against  violations  of  confidentiality  of  the  data. 
On  the  other  hand,  researchers  argue  that  publicly 
supported  data  collections  should  be  available  to  the  public, 
or  at  least  to  competent  researchers.  Data  sets  can  be 
purged  or  cleaned  of  identifying  information.  Competent 
researchers  can  do  responsible  secondary  analyses  of  the 
data  while  simultaneously  upholding  the  normative 
requirements  for  protection  of  confidentiality.  The 
investment  of  public  funds  in  data  supercedes  ownership 
rights  at  least  with  respect  to  access  to  the  data,  as  also  do 
the  norms  of  science  as  an  activity  open  to  and  dependent 
upon  the  scrutiny  and  review  of  other  scientists. 

Since  1962,  ICPSR  has  been  responsible  for  many  of  the 
technical  and  normative  developments  in  social  science 
data  sharing.  As  an  archive  that  acquires  data  from  many 
principal  invesfigators.  ICPSR  has  had  to  develop  and 
implement  procedures  that  assure  original  investigators  that 
the  distribution  of  their  data  will  not  compromise  the 
protection  of  confidentiality.  As  an  archive  that  distributes 
data  to  a  wide  variety  of  users,  ICPSR  has  had  to  develop 
and  implement  these  same  procedures  to  substantially 
reduce  or  eliminate  the  opportunity  for  secondary  users  to 
compromise  confidentiality  even  if  they  wanted  to.  Over 
the  past  36  years,  ICPSR  has  had  to  respond  to  new 
technical  challenges  in  protecting  the  confidentiahty  of 
data,  while  simultaneously  charting  a  course  that  satisfies 
both  proponents  and  opponents  of  data  sharing,  both  data 
producers  and  data  users. 


In  this  paper,  we  briefly  review  the 
origins  of  ethical  requirements  and 
by  Christopher  S.  Dunn  regulations  for  the  protection  of 

&  Erik  W.  Austin  '  confidenfiality  of  research  data  and  ways 

that  confidentiality  can  be  violated.  That 

discussion  sets  the  stage  for  a  description 

of  the  nature  and  development  of  ICPSR 

^^^^^^^^^^^^^^■■H     practices  to  assure  confidentiahty  of 

research  data.  These  practices  have  had 
to  take  account  of  both  technical  developments  in  the 
capacity  to  store,  distribute  and  analyze  data  and  normative 
developments  in  the  biomedical  and  social  sciences  about 
data  sharing.  Finally,  we  describe  some  trends  in  research 
that  pose  yet  new  problems  for  protecting  confidentiality  of 
research  data  and  some  new  approaches  to  protecting 
confidentiality. 

Ethics  and  Regulations 

Biomedical  sources. 

Surprisingly,  a  review  of  the  foundational  documents  that 
raised  the  consciousness  about,  and  led  to  Federal 
regulation  of,  the  protection  of  human  subjects  in  research 
revealed  very  little  attention  to  or  concern  with  the  privacy 
of  research  data  and  the  protection  of  confidentiahty.  The 
Nuremberg  Code  (OPRR  1993c)  addressed  informed 
consent,  social  benefits  of  research,  avoidance  of  suffering 
and  injury,  risks  to  subjects  not  greater  than  the  importance 
of  the  problem,  and  preparations  and  facilities  for 
protection  of  subjects  against  injury,  disability  and  death. 
But  it  did  not  address  issues  of  confidentiahty  and  privacy. 
Beecher's  seminal  publications  (1966a.  1966b)  focused 
primarily  on  safeguarding  the  physical  health  of  research 
subjects,  the  absence  of  voluntary  participation,  and  the 
need  for  informed  consent.  The  Belmont  Report's  (OPRR 
1993a)  discussion  of  three  basic  ethical  principles  (respect 
for  persons,  beneficence,  and  justice)  did  not  mention 
safeguarding  privacy  of  research  subjects  or  protecting  the 
confidentiality  of  data  obtained  from  them.  The  closest  it 
came  was  in  describing  the  principle  of  beneficence  as 
making  efforts  to  secure  the  well  being  of  persons  through 
minimizing  possible  harms.  Of  the  basic  documents,  only 
the  Helsinki  Declaration  mentioned  privacy:  "Every 
precaution  should  be  taken  to  respect  the  privacy  of  the 
subject  and  to  minimize  the  impact  of  the  study  on  the  ... 
subject."  (OPRR  1993b)  But  it  did  not  extend  this 
discussion  of  principles  to  its  practical  implication  for 
protecting  confidentiahty.  Finally,  in  the  current  Federal 
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regulations  governing  human  subjects  protection, 
confidentiality  is  mentioned  only  as  an  element  of  content 
of  an  informed  consent  statement.  ". .  .in  seeking  informed 
consent,  the  following  information  shall  be  provided  to 
each  subject:  ...  (5)  a  statement  describing  the  extent,  if 
any,  to  which  confidentiality  of  records  identifying  the 
subject  will  be  maintained;"  [45  CFR  46.1 16(a)(5)]. 

It  seems  likely  that  these  foundational  documents  largely 
ignored  privacy  and  consent  issues  because  of  their 
biomedical  research  origins,  their  primary  concern  with 
protection  of  the  physical  health  and  well  being  of  the 
subjects,  and  the  (false)  assumption  that  physicians  would 
be  the  primary  personnel  conducting  biomedical  research 
with  people.  Under  these  conditions,  research  information 
from  or  about  human  subjects  is  equated  with  information 
obtained  under  the  privacy  and  confidentiality  of  the 
physician-patient  privilege  in  a  chnical  relationship.  So 
apparently  little  if  anything  was  said  about  privacy  and 
confidentiality  in  the  early  biomedical  discussions. 

Early  social  science  data  collection  organizations. 
A  sharp  contrast  is  presented  in  the  early  history  of  the 
social  sciences.  Eckler,  a  former  Director  of  the  Census 
Bureau,  reported  that  for  the  first  five  censuses  (1790- 
1830),  copies  of  returns  were  pubhcly  posted  for 
corrections  or  additions  of  missing  information  and  were 
deposited  with  local  courts  (1972: 164).  He  also  reported 
that  the  sixth  census  ( 1 840)  was  the  first  to  instruct 
assistant  marshals  (i.e.,  field  enumerators)  that  they  were  to 
"consider  all  communications  made  to  him  in  the 
performance  of  his  duty,  relative  to  the  business  of  the 
people,  as  strictly  confidential"  (Eckler  1972:165).  Eckler 
speculated  that  this  phrase  was  introduced  into  the 
instructions  either  to  deter  or  curtail  the  private  use  of  an 
increased  amount  of  economic  information  collected  in  the 
1840  Census,  or  to  improve  the  reliability  of  reports  to 
enumerators. 

Protecting  the  confidentiality  of  data  was  an  important 
concern  for  two  of  the  early  leaders  of  social  statistics, 
Francis  A.  Walker,  Superintendent  of  the  1870  and  1880 
Censuses,  and  Carroll  D.  Wright,  first  Commissioner  of 
Labor  beginning  in  1885  and  later  Director  of  the  Census. 
Up  until  1902,  the  Census  was  a  temporary  organization 
brought  into  existence  each  decade  by  legislation  and 
terminated  soon  after  issuing  its  reports.  Congresses  during 
the  19*  century  were  indisposed  to  creating  new, 
permanent  Federal  agencies.  Walker  was  appointed 
Superintendent  of  the  ninth  Census  (1870),  the  plans  for 
which  had  become  embroiled  in  larger  pohtical  issues  of 
apportionment  of  House  of  Representative  seats  and  black 
suffrage  (Anderson,  1988:76-81).  The  40*  Congress  set 
aside  plans  for  a  more  scientific  census  proposed  by  then 
Rep.  (later  President)  James  A.  Garfield  and  the  1870 
Census  proceeded  under  the  1 850  Census  legislation.  In 
the  ensuing  decade.  Walker  suggested  a  number  of 


scientific  and  operational  reforms  for  the  census  and  a 
quinquennial  census  in  1875  (which  never  came  to  pass) 
(Wright  1900:58).  The  1870  Census  was  the  last  census 
that  used  judicial  marshals  appointed  by  the  Senate  to 
supervise  data  collection  in  the  states. 

The  tenth  Census  in  1880  used  "supervisors  of  census" 
appointed  by  the  President  and  who  numbered  more  than 
twice  as  many  as  the  judicial  marshals,  thereby  providing 
more  direct  supervision  of  the  actual  work  of  enumeration 
(Wright  1900:59)  and  centraUzed  planning  and  control 
(Anderson  1988:99).  Each  enumerator  had  to  make  daily 
reports  and  submit  signed  copies  of  original  data  schedules. 
Most  importantly  (for  our  present  concern),  the  enabUng 
legislation  for  the  1880  Census  provided  elementary  forms 
of  protection  of  confidentiality  of  the  data.  First,  the  oath 
of  office  signed  by  enumerators  required  that  they  "will  not 
disclose  any  information  contained  in  the  schedules,  Usts  or 
statements  obtained  by  me  to  any  person  or  persons,  except 
to  my  superior  officers."  (Wright  1900:937:Section  7  of  the 
Act  to  provide  for  taking  the  tenth  and  subsequent 
censuses).  Second,  Section  12  of  the  enabUng  legislation 
made  it  a  crime  to  violate  the  confidentiality  of  responses: 

"That  any  supervisor  or  enumerator,  who.  having  taken 
and  subscribed  the  oath  required  by  this  act,  . . .  shall, 
without  the  authority  of  the  Superintendent,  communicate 
to  any  person  not  authorized  to  receive  the  same,  any 
statistics  of  property  or  business  included  in  his  return, 
shall  be  deemed  guilty  of  a  misdemeanor,  and  upon 
conviction  shall  forfeit  a  sum  not  exceeding  five  hundred 
dollars."  (Wright  1900:938) 

In  the  eleventh  Census  (1890)  the  language  about  "any 
statistics  of  property  or  business"  was  changed  to  "any 
information  gained  by  him  in  the  performance  of  his 
duties."  (Wright  1900:946) 

As  Walker's  reforms  proceeded  (including  appointments 
based  on  merit  rather  than  patronage),  the  size  of  the  1880 
Census  organization  grew  but  it  ran  out  of  appropriated 
funds  in  1881.  Walker  resigned  in  1881,  moving  to  the 
presidency  of  M.I.T.  After  criticism  and  buffeting  by 
Congress  and  the  popular  press,  control  of  the  Census 
remnants  and  reporting  finally  passed  to  Wright  in  1885 
and  was  finally  completed  in  1888  just  before  the  need  for 
legislation  for  the  eleventh  Census  (1890). 

The  policy  language  about  confidentiality  that  had 
undergone  modest  changes  from  1840  through  1890 
applied  only  to  data  collectors.  Other  Census  employees, 
in  particular,  tabulation  clerks,  and  increasingly  in  1880 
and  1890,  professional  staff,  were  not  similarly  enjoined. 
Thus,  in  the  law  providing  for  the  twelfth  Census  (1900), 
Eckler  reported  that  "confidential  treatment  of  the  census 
records  was.  for  the  first  time,  required  of  all  employees, 
and  penalties  for  violation  were  appUcable  to  everyone 
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(1972:165).  Similarly,  in  the  law  that  provided  for  the 
1910  Census,  Eckler  reported  that  "the  possibility  of 
disclosure  through  published  reports"  was  addressed  by 
instructions  in  the  industrial  censuses  that  indicated  that 
publication  was  to  be  made  in  such  a  way  as  "not  to  reveal 
the  report  of  any  establishment"  (1972:165).  This  same 
provision  was  not  extended  to  the  population  and 
agriculture  censuses  until  1930.  presumably  because  of  the 
lower  risk  of  identifying  people  than  companies.  For  the 
1920  Census,  data  sharing  with  other  government  officials 
was  strictly  limited  "by  the  provision  that  in  no  case  should 
the  information  thus  furnished  be  used  to  the  detriment  of 
the  person  to  whom  it  relates"  (Eckler  1972: 165). 

Things  were  more  informal  in  the  Department  of  Labor. 
Plewes  (1985:222)  reported  that  Carroll  Wright 
operationalized  the  standards  for  protecting  confidentiality 
of  data  on  a  personal  basis.  He  sent  telegrams  to 
businessmen,  pledging  his  word  "as  a  government  officer 
that  names  of  your  plants  and  of  city  and  state  in  which 
located  shall  be  concealed  (Plewes  1985:222).  Plewes 
suggested  that  obtaining  cooperation  for  data  collection 
about  sensitive  topics  like  working  hours  and  conditions, 
child  labor,  and  wage  practices  was  the  motivating  force 
behind  these  personal  persuasions.  These  practices 
eventually  became  associated  with  such  higher  objectives 
as  "integrity,  impartiality  and  independence"  (Plewes 
1985:222).  Plewes  also  noted  that  (as  of  the  date  of  his 
remarks,  March  1985),  the  Bureau  of  Labor  Statistics  was 
one  of  only  two  Federal  statistical  agencies  whose  policies 
of  protecting  confidentiality  have  existed  without  the 
protection  of  an  agency  wide  confidentiality  statute. 

The  expansion  of  the  Federal  government  has  been 
accompanied  by  the  expansion  of  its  information  collection 
role  and  activities.  As  more  and  more  kinds  of  data  have 
been  collected,  issues  surrounding  the  confidentiality  of 
and  access  to  govenmient  statistics  have  also  increased.  In 
many  instances,  agency  practices  have  been  formalized  into 
statutory  protections  of  confidentiality  of  statistical  data 
and  prevention  of  compulsory  disclosure.  For  example. 
Title  13  of  the  United  States  Code  governs  the  activities  of 
the  U.S.  Census  Bureau.  In  section  9,  requirements  for  the 
confidentiality  of  census  data  are  spelled  out. 

(a)  Neither  the  Secretary,  nor  any  other  officer  or 
employee  of  the  Department  of  Commerce  or  bureau  or 
agency  thereof,  or  local  government  census  liaison,  may, 


(3)  permit  anyone  other  than  the  sworn  officers  and 
employees  of  the  Department  or  bureau  or  agency 
thereof  to  examine  the  individual  reports.  (13  USC  9) 

Where  individual  reports  are  allowed  to  be  shared  with 
government  officials,  those  records  are  "immune  from  legal 
process,  and  shall  not,  without  the  consent  of  the  individual 
or  establishment  concerned,  be  admitted  as  evidence  or 
used  for  any  purpose  in  any  action,  suit,  or  other  judicial  or 
administrative  proceeding"  (13  USC  9(a)(3)). 

Microdata  from  U.S.  Department  of  Justice  supported 
research  also  has  confidential  status  and  is  prohibited  from 
uses  in  the  legal  process  other  than  statistical  research: 

....  no  officer  or  employee  of  the  Federal  Government, 
and  no  recipient  of  assistance  under  the  provisions  of  this 
chapter  shall  use  or  reveal  any  research  or  statistical 
information  furnished  under  this  chapter  by  any  person 
and  identifiable  to  any  specific  private  person  for  any 
purpose  other  than  the  purpose  for  which  it  was  obtained 
in  accordance  with  this  chapter.  Such  information  and 
copies  thereof  shall  be  immune  from  legal  process,  and 
shall  not.  without  the  consent  of  the  person  furnishing 
such  information,  be  admitted  as  evidence  or  used  for  any 
purpose  in  any  action,  suit,  or  other  judicial,  legislative,  or 
administrative  proceedings.  (42  USC  3789g) 

Professional  association  ethical  guidelines. 
A  third  source  of  confidentiahty  restrictions  is  the  ethical 
guidelines  of  professional  associations.  The  post  Civil  War 
decades  of  the  19"^  century  and  first  two  of  the  20"*  century 
brought  immense  technological  development,  world 
changing  scientific  discoveries  in  physics  and  chemistry, 
major  demographic  changes  in  American  society,  and  the 
development  of  professions  and  professional  organizations. 
The  American  Statistical  Association  and  the  American 
Economics  Association  were  front  runners  in  the 
movement  to  lobby  for  a  permanent  Census  Bureau.  These 
associations  were  made  up  of  persons  who  had  prior  direct 
experience  with  the  censuses  or  whose  graduate  students 
worked  with  the  Census  or  with  the  Department  of  Labor. 
Thus,  it  is  not  surprising  that  ethical  guidehnes  or  codes  of 
professional  social  science  organizations  eventually 
reflected  confidentiahty  pohcies.  The  same  people  who 
were  leaders  in  the  associations  were  also  leaders  in  the 
emerging  disciphnes  and  professions  of  the  social  sciences 
and  social  statistics  in  which  confidentiality  pohcies  were 
first  introduced. 


(1 )  use  the  information  furnished  under  the 
provisions  of  this  title  for  any  purpose  other  than  the 
statistical  purposes  for  which  it  is  supplied;  or 

(2)  make  any  pubhcation  whereby  the  data  furnished 
by  any  particular  estabhshment  or  individual  under  this 
title  can  be  identified;  or 


In  general,  professional  associations  are  concerned  with 
promoting  the  professionalism  (and  status)  of  their  work. 
Some  essential  aspects  of  professionalism  are  the  ability  to 
control  or  discipline  members  at  the  fringes  of  respectable 
practice  and  the  provision  of  members  with  resources 
against  outside  disciplinary  or  malpractice  actions. 
Associations  have  developed  codes  of  ethics  that  educate 
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members  about  allowable  practices  or  ethically  suspect 
practices,  and  that  guide  behavior  in  gray  areas.  Many 
address  the  protection  of  confidentiality  of  sources  or  data. 

The  American  Sociological  Association  requires 
sociologists  to  "take  reasonable  steps  to  ensure  that  records, 
data,  or  information  are  preserved  in  a  confidential 
manner,"  and  that  when  confidential  records,  data  or 
information  are  transferred  to  other  persons  or 
organizations,  "they  obtain  assurances  that  the  recipients  . . . 
will  employ  measures  to  protect  confidentiaUty  at  least 
equal  to  those  originally  pledged."  (ASA  1997,  Section 
11.08). 

The  American  Political  Science  Association  addresses  the 
potential  conflict  between  civic  and  legal  obUgations  to 
cooperate  with  governmental  organizations  and  the 
"professional  duty  not  to  divulge  the  identity  of 
confidential  sources  of  information  or  data  developed  in  the 
course  of  research."  (APSA  1998:Section  6)  They  are  also 
required  to  observe  Federal  and  university  rules  and 
regulations  for  the  protection  of  human  subjects,  including 
protection  of  confidentiality  of  data.  (APSA  1998:  Section 
34) 

The  American  Statistical  Association,  founded  in  1839,  has 
recently  released  a  new  draft  pubUcation,  Ethical 
Guidehnes  for  Statistical  Practice,  for  comments.  The 
section  on  ethical  responsibilities  to  research  subjects 
includes  the  following  item:  "Protect  the  privacy  and 
confidentiality  of  research  subjects  and  the  data  they 
provide." 

(American  Statistical  Association  1998:  Section  II. D. 3  at 
http://www.amstat.org/about/ethics.html) 

The  American  Association  of  Pubhc  Opinion  Research  also 
has  a  Code  of  Professional  Ethics  and  Practices  poUcy 
pledging  confidentiaUty.  "Unless  the  respondent  waives 
confidentiality  for  specified  uses,  we  shall  hold  as 
privileged  and  confidential  all  information  that  might 
identify  a  respondent  with  his  or  her  responses."  (AAPOR 
1998:  Section  II. D. 2  at  http://www.aapor.org/ethics/ 
principl.shtml) 

Summary. 

The  present  emphasis  on  the  biomedical  roots  of  modem 
human  subjects  protection  regulations  and  their  original 
implementation  in  the  Department  of  Health  and  Human 
Services  obscures  some  important  origins  of  the  protection 
of  confidentiaUty  of  records  and  data.  Foundational 
documents  of  ethical  principles  of  biomedical  research  are 
largely  silent  on  issues  of  privacy  and  confidentiaUty.  In 
contrast,  mid  19'*'  century  US  Census  legislation  required 
enumerators  to  keep  information  they  collected  in  the 
course  of  the  Census  confidential,  principally  as  an 
instrumental  means  to  promote  subject  cooperation  and 


truthful  response.  The  practice  of  maintaining 
confidentiality  of  Census  data  was  extended  to  all  Census 
employees  in  1900.  Gradually,  professional  associations 
adopted  policies  for  the  protection  of  confidentiality. 
These  policies  are  based  not  on  instrumental  values  Uke 
improving  the  cooperation  of  respondents  and  accuracy  of 
the  data  but  on  ethical  principles  Uke  safeguarding  the 
privacy  of  individuals  and  minimizing  potential  harm  to 
subjects  through  disclosure  of  sensitive  information  to  third 
parties. 

US  Census  and  Bureau  of  Labor  Statistics  confidentiaUty 
practices  initiated  in  the  late  19*  and  early  20"^  centuries 
anticipated  two  of  the  four  major  possibilities  for  failure  to 
maintain  confidentiaUty.  The  early  statements  about 
treating  information  as  confidential  in  the  Census  enabUng 
legislation  from  1840-1890  and  their  extension  to  all 
Census  employees  in  1900  recognized  that  individuals  with 
legitimate  access  to  microdata  could  also  behave 
illegitimately  by  selUng  or  transferring  data  to  third  parties. 
Industrial  census  guidelines  in  1910  and  population  and 
agriculture  census  guidelines  in  1930  recognized  that 
individual  or  microlevel  identities  could  be  deduced  from 
macrolevel  tabular  data  with  small  cell  sizes,  thereby 
reflecting  the  first  concerns  about  statistical  disclosure.  In 
the  next  section  we  describe  four  main  categories  of  failure 
to  maintain  confidentiaUty  as  a  preface  to  describing 
activities  undertaken  to  protect  confidentiality  of  archival 
data. 

Ways  That  Confidentiality  Can  Be  Violated 

There  are  four  major  ways  that  confidentiaUty  can  be 
violated,  resulting  in  the  release  or  deduction  of  individual 
identities  and/or  identifying  characteristics:  accidental 
release;  maUcious  release;  compulsory  release;  and 
statistical  disclosure. 

Accidental  release  may  be  due  to  sloppy  data  management 
procedures,  ignorance  or  errors  on  the  part  of  staff,  or 
failure  to  follow  standard  procedures. 

Malicious  release  may  be  due  to  theft  or  unauthorized 
transfer  of  data  by  disgruntled  staff  or  by  staff  or  others 
seeking  financial  gain,  or  through  breaches  of  computer 
systems  security. 

Compulsory  release  may  occur  as  the  result  of  legal  action 
or  court  order. 

Statistical  disclosure  results  from  logical  use  or  analysis  of 
data  to  identify  cases  or  events  that  are  infrequent  or  rare, 
or  unique  patterns  of  characteristics  which  when  associated 
with  data  from  other  sources,  lead  to  subject  identification. 

The  value  of  these  categories  is  not  merely  descriptive. 
They  also  direct  attention  toward  objects  or  mechanisms  for 
maintaining  confidentiality.  The  idea  of  accidental  release 
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suggests  that  confidentiality  is  preserved  by: 

•  educating  staff  about  the  need  for  confidentiality 
protection  procedures; 

•  training  and  monitoring  staff  in  the  application  of 
those  procedures; 

•  performing  quality  control  checks  on  data  files  that 
are  developed  for  restricted  use  or  public  release;  and 

•  maintaining  adequate  security  for  confidential 
information. 

The  central  feature  of  malicious  release  is  the  idea  that 
information  (and  hence  data)  has  value  and  that  there  are 
people,  whatever  their  motives,  who  may  attempt  to 
translate  that  value  into  cash  or  otherwise  use  the 
information  inappropriately.  Disgruntled  staff,  for 
example,  may  satisfy  a  symbolic  urge  for  retaliation  or 
retribution  by  unauthorized  transfer  or  release  of 
information.  Regardless  of  whether  the  motive  is 
instrumental  or  symbolic,  the  inappropriate,  illegal 
behavior  can  be  counteracted  by  deterrence  and 
punishment.  These  dynamics  suggest  that  organizations 
should  have  and  use  policies  that  prohibit  the  unauthorized 
use,  transfer,  or  release  of  data.  In  the  case  of  public 
release,  even  though  there  is  no  restriction  on  who  can 
access  the  available  data,  there  ought  to  be  use  restrictions 
consistent  with  the  research  and  educational  purposes  of  the 
organization. 

The  matter  of  compulsory  release  is  too  complicated  and 
uncertain  to  be  dealt  with  in  an  encapsulated  discussion 
here.  It  is  sufficient  to  note  that  the  ethics  of  research  are 
not  the  only  requirements  that  researchers  face  and  that  the 
legal  protection  accorded  the  confidentiality  of  research 
data  is  not  absolute  or  uniform  across  states  or  in  different 
legal  matters.  Researchers  have  been  ordered  to  release 
confidential  data.  Some  have  complied,  others  have 
refused  and  been  penalized,  still  others  have  had  initial 
orders  overturned  or  modified  on  appeal.  Again,  the  focus 
with  this  type  of  release  seems  to  be  a  strong  organizational 
policy  against  compulsory  release  that  has  as  its  basis  the 
necessity  of  confidentiality  in  social  research.  Where 
possible,  such  policies  should  be  backed  up  by  regulatory 
or  statutory  nondisclosure  protections,  such  as  the  DHHS 
certificates  of  confidentiality  or  the  US  Department  of 
Justice  statutes  (42  USC  3789g)  and  regulations  (28  CFR 
22)  prohibiting  evidentiary  or  other  non-research  uses  of 
justice  research  data. 

The  topic  of  statistical  disclosure  is  also  too  complex  to  be 
dealt  with  in  an  encapsulated  discussion.  But  fortunately, 
there  is  more  information  available  on  this  topic  than  on  the 
others.  Statistical  disclosure  has  been  the  focus  of  both 


professional  and  academic  attention.  There  are  a  variety  of 
estabUshed  methods  for  preventing  disclosure  (Cox  et  al. 
1985;  OMB  1994).  There  is  also  developmental  work  in 
progress  for  devising  and  testing  new  methods  (e.g., 
Duncan  undated;  Dutta  Chowdhury  et  al.  undated).  Some 
of  these  have  been  discussed  at  this  meeting.  But  the 
central  feature  of  this  way  that  confidentiality  is  preserved 
is  its  technical  focus  on  the  data  themselves. 

In  general  then,  there  are  four  approaches  on  which  to 
focus  attention  for  protecting  the  confidentiality  of  research 
data: 

•  education  and  training  of  persons  who  work  with 
data; 

•  data  management  techniques  and  statistical 
procedures  that  can  be  applied  to  data; 

•  organizational  policies  that  mandate  confidentiality 
and  data  security;  and 

•  government  regulations  and  laws  that  protect  the 
confidentiality  of  research  data. 

The  next  section  of  this  paper  focuses  attention  on  the  first 
two  of  these  approaches  at  ICPSR. 

Practices  At  ICPSR  To  Assure  Protection  of 
Confidentiality 
Data  modifications. 

These  sections  borrow  heavily  from  the  ICPSR  Guide  To 
Social  Science  Data  Preparation  And  Archiving. 
(Material  taken  from  Second  printing  1997:16-17  is 
italicized.)  Two  kinds  of  variables  often  found  in  social 
science  data  sets  present  problems  that  could  endanger  the 
confidentiality  of  research  subjects.  Most  familiar  are  the 
direct  identifiers  that  may  have  been  obtained  in  the 
process  of  data  collection.   These  include  items  such  as 
names,  addresses  {including  ZIP  codes),  telephone 
numbers  {including  exchanges}.  Social  Security  numbers, 
and  other  linkable  identification  numbers  such  as  driver 
license  numbers,  certification  numbers,  etc  s.  Data 
collectors  should  remove  all  such  identifiers  when 
preparing  public  use  data  sets.  If  data  sets  are  received 
with  such  variables,  ICPSR  will  remove  them  as  part  of  the 
lowest  level  of  study  processing.  Increasingly, 
consideration  is  being  given  to  returning  to  investigators 
data  sets  that  are  received  with  direct  identifiers.  This  is 
because  ICPSR  practice  is  to  preserve  originally  submitted 
data  that  could  become  the  focus  of  legal  action  should  it 
be  known  that  ICPSR  maintains  a  copy  of  such  a  data  set. 

Another  category  of  variables  can  often  become 
problematic  depending  on  the  content  of  the  data  collection 
and  the  nature  of  the  research  subjects  included  in  the  data 
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set.  TTiese  are  indirect  identifiers  that  might  be  used  (in 
combination  or  in  conjunction  with  publicly-available 
information)  to  identify  individual  respondents.   This 
category  is  harder  to  deal  with,  since  it  includes  items  that 
are  often  the  focus  of  or  useful  for  statistical  analysis.  That 
is  probably  why  such  information  was  collected  in  the  first 
place.  Some  examples  of  these  indirect  identifiers  are 
detailed  geography  (e.g.,  state,  county,  or  Census  tract  of 
residence),  organizations  to  which  the  respondent  belongs, 
educational  institution  from  which  the  respondent 
graduated  (and  year  of  graduation),  exact  occupations 
held,  place  where  the  respondent  grew  up,  exact  dates  of 
events,  detailed  income,  and  offices  or  posts  held  by  the 
respondent.  Such  indicators  should  be  reviewed  by  the 
principal  investigator/data  collector  and  a  judgment  made 
about  the  effect  of  retaining  such  items  upon  the 
confidentiality  of  the  research  subjects  before  depositing 
the  data  in  a  public  archive. 

Sometimes,  variables  usually  considered  to  be  indirect 
identifiers  can  become  direct  identifiers  depending  upon 
features  of  the  research  design.  Job  title  or  occupational 
role  can  directly  identify  a  respondent  when  there  is  only 
one  such  position  in  an  organization,  one  such  organization 
in  a  town  (or  department  in  an  organization),  and  the  town 
(or  organization)  is  identified,  as  well  as  the  date  of  the  data 
collection.  For  example,  if  the  police  chief,  Presbyterian 
minister,  high  school  principal,  or  any  other  unique  figure 
in  a  community  or  organization  identifies  their  job  title  or 
occupational  role,  and  the  community  or  organization  is 
also  identified,  and  the  date  of  the  data  collection  is  known, 
then  it  is  easy  to  find  out  exactly  who  that  person  was  at 
that  time. 

Handling  indirect  identifiers. 

If,  in  the  judgment  of  the  principal  investigator,  a  variable 
might  act  as  an  indirect  identifier  (and  thus  could  be  used 
to  compromise  the  confidentiality  of  a  research  subject), 
the  investigator  should  "treat"  that  variable  when 
preparing  a  public  use  data  set.  Modifications  commonly 
used  are: 

•  removal — eliminating  the  variable  from  the  data  set 
entirely; 

•  bracketing — combining  the  categories  of  a  variable; 

•  top-coding — grouping  the  upper  range  of  a  variable 
to  eliminate  outliers; 

•  collapsing  and/or  combining  variables — merging 
the  concepts  embodied  in  two  or  more  variables  by 
creating  a  new  summary  variable. 

The  following  example  is  taken  from  the  ICPSR  Guide  To 
Social  Science  Data  Preparation  And  Archiving  (1997:17). 
An  e.xample  from  a  national  survey  of  physicians 


(containing  many  details  of  each  doctor's  practice  patterns, 
background,  and  personal  characteristics)  may  help  to 
illustrate  each  of  these  categories  of  treatment  of  variables 
to  protect  confidentiality.   Variables  identifying  the  school 
from  which  the  medical  degree  was  obtained  and  the  year 
graduated  should  probably  be  removed  entirely,  due  to  the 
ubiquity  of  publicly  available  rosters  of  college  and 
university  graduates.  The  state  of  residence  of  the 
physician  could  be  bracketed  into  a  new  "Region  "  variable 
(substituting  more  general  geographic  categories  such  as 
"East,  "  "South.  "  "Midwest,  "  and  "West.  ")  The  upper  end 
of  the  range  of  the  "physician's  income"  variable  could  be 
top-coded  (e.g.,  " $150,000  or  more")  to  avoid  identifying 
the  most  highly  paid  individuals.   Finally,  a  series  of 
variables  documenting  the  responding  physician 's 
certification  in  several  medical  specialties  could  be 
collapsed  to  a  summary  indicator  (with  new  categories 
such  as  "Surgery,  "  "Pediatrics,  "  "Internal  Medicine, " 
"Two  or  more  specialties,  "  etc.). 

ICPSR  staff  consult  with  principal  investigators  to  help 
them  design  or  modify  a  public  use  data  set  that  maintains 
(to  the  maximum  degree  possible)  the  confidentiality  of 
respondents.   The  staff  will  additionally  perform  an 
independent  confidentiality  review  of  data  sets  submitted  to 
the  archive  and  will  work  with  the  investigators  to  resolve 
any  remaining  problems  of  confidentiality.  The  goal  of  this 
cooperative  approach  is  to  ensure  that  all  reasonable  steps 
have  been  taken  to  protect  the  confidentiality  of  research 
respondents  whose  information  is  contained  in  ICPSR 's 
public  use  data  sets. 

Research  Trends  that  Pose  Problems  for  Confldentiality 

Some  types  of  studies  include  variables  that  pose  unusually 
difficult  or  problematic  threats  to  confidentiality  but  are 
also  difficult  to  modify  because  of  their  central  importance 
to  the  study.  One  such  study  is  the  multi  level  study  having 
hierarchical  files  with  linkage  variables  between  files. 
Another  type  is  the  study  that  has  exact  event  dates  and 
birth  dates.  A  third  type  is  the  study  with  geo-coded 
information.  A  fourth  type  is  the  qualitative  narrative 
interview  study.  A  fifth  type,  the  longitudinal  panel  study, 
is  not  especially  problematic  when  ready  for  archiving,  but 
the  need  to  maintain  linkage  and  locator  identifiers  from 
one  round  to  the  next  makes  the  study  vulnerable  to  threats 
to  confidentiality  during  its  operational  phases. 

Multi-level  studies,  where  data  is  collected  about  places, 
organizations,  households,  persons  and  events, 
simultaneously,  is  especially  difficult  to  handle  with  the 
usual  means  of  modifying  variables.  Often,  information  in 
the  multiple  levels  of  files  will  make  it  easy  to  identify 
individual  subjects,  but  the  linkage  variables  between  files 
are  essential  to  maintain  the  multi-level  value  of  the  study. 
Where  identification  risks  are  high  because  the  multiple 
levels  of  information  make  it  easy  to  narrow  the  focus  on 
individuals,  ICPSR  will  consider  making  the  study  a 
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restricted  use  data  set. 

Studies  with  many  precisely  dated  events  and  birth  dates 
also  pose  risks  to  confidentiality,  especially  if  the  event 
information  also  might  have  been  publicized  in  the  media 
or  recorded  in  publicly  available  administrative  records 
(e.g.,  court  dockets).  Exact  dates  in  the  study  information 
and  event  characteristics  can  be  matched  against  media  or 
administrative  record  data  allowing  subjects  to  be  easily 
identified.  Nevertheless,  the  exact  date  information  is  often 
useful  for  various  forms  of  time  dependent  analyses  like 
survival  analysis  or  event  history  analysis.  Removing  exact 
dates  reduces  the  value  of  the  information.  Once  again,  the 
solution  may  be  creating  a  restricted  data  set  rather  than 
removing  information. 

Studies  with  geo-coded  information  are  also  problematic. 
Depending  upon  the  nature  of  other  information  in  the 
study  and  the  degree  of  area  resolution,  geo-coded  studies 
may  make  it  easy  to  identify  subjects,  especially  when 
public  information  is  available.  For  example,  it  would  be 
inappropriate,  unethical,  and  potentially  dangerous  to 
release  a  data  set  with  the  address  locations  of  rape  victims. 
Again,  resolving  these  kinds  of  problems  caused  by 
multiple  levels  of  information  is  not  a  simple  process  of 
modifying  indirect  identifiers  because  of  the  nature  of  the 
study. 

Qualitative  narrative  interviews  are  another  type  of 
problematic  study.  The  level  of  detail  provided  through  in- 
depth  interviews  is  extensive  and  often  contains  many 
references  to  people,  places,  events,  associations, 
organizations,  family  relationships,  persons  not  Uked  at 
work,  and  so  forth.  Someone  with  intimate  knowledge  of 
these  patterns  of  information  may  be  able  to  easily  identify 
the  individuals  involved.  The  very  richness  of  the  detailed 
information  is  simultaneously  the  value  of  the  study  and  the 
threat  to  confidentiality.  Original  investigators  are  loathe 
to  restrict  the  richness  of  the  narratives,  yet  are  unwiUing  to 
release  such  detailed  information  because  of  the  ease  of 
identifying  individuals  involved  in  the  scenes. 

Providing  Access  To  Original  Indirect  Identiflers 

It  is  rarely  the  case  that  variables  removed  or  modified  to 
maintain  confidentiality  are  without  value  for  research 
purposes.  Archives  and  other  data  providers,  therefore, 
frequently  field  requests  for  some  form  of  access  to  original 
data  values.  Three  of  these  forms  of  access  that  have  been 
utilized  will  be  discussed  here:  customized  data  analysis 
performed  by  the  archive/data  provider;  private  use  data 
sets;  and  front-end  software. 

The  first  method  of  providing  access  to  restricted  indirect 
identifiers  retains  the  data  in  secure  form  but  permits 
researchers  to  design  analyses  that  use  those  data. 
Customized  data  analysis  (often  performed  at  cost  to  the 
researcher)  affords  the  opportunity  of  obtaining  analytic 


results  from  restricted  variables.  Typically,  researchers 
will  be  asked  to  provide  detailed  analytic  instructions — 
usually  in  the  form  of  software  commands — and  the 
requested  analyses  are  performed  at  the  archive,  with 
analytic  output  sent  to  the  requesting  party.  At  ICPSR  and 
elsewhere,  the  output  is  examined  by  staff  to  ensure  that  the 
analysis  results  will  not  endanger  the  confidentiality  of 
respondents.  Delivery  of  a  private-use  data  set  allows 
original  data  values  to  be  provided  to  a  researcher,  with  the 
requestor  explicitly  assuming  responsibility  for  maintaining 
confidentiality  of  those  data.  Most  organizations  that 
provide  private-use  data  sets  require  a  transaction  form, 
replete  with  both  researcher  and  official  signatures 
certifying  that  such  data  will  be  securely  held,  to  be  used 
only  by  the  requesting  party  in  ways  that  protect  respondent 
confidentiality.  A  third  mechanism  bundles  an  entire  data 
set  in  an  analytic  software  package  which  prevents 
examination  of  discrete  values/cases  while  allowing 
statistical  access  to  all  variables.  This  front-end  software 
alternative  usually  prevents  extracting  or  downloading  of 
original  values  on  some  or  all  variables.  (The  National 
Center  for  Education  Statistics'  Data  Analysis  System 
[DAS]  is  one  example  of  such  a  software-based  method  of 
protecting  the  confidentiality  of  research  subjects.  Other 
such  front-ends  are  actively  being  explored,  including  at 
ICPSR.) 

Each  of  the  mechanisms  described  above  has  advantages  as 
well  as  drawbacks.  None  are  completely  satisfactory  to 
both  the  research  community  and  the  repository /holder  of 
original  data.  Tightest  control  of  original  data  values  is  an 
attraction  of  the  customized  data  analysis  option,  but  is  the 
least  popular  with  active  researchers.  It  is  typically  costly 
(in  terms  of  both  time  and  money),  and  frequently  thwarts 
the  iterative  analytic  style  most  common  in  the  social 
sciences.  Private-use  data  sets  permit  the  most  researcher 
control  of  the  analytic  process,  at  the  expense  of  certainty 
of  protection  of  respondent  confidentiahty.  Enforcement  of 
private-use  data  set  provisions  agreed  to  by  requestors  is 
difficult  to  effect,  and  sanctions  against  violators  of 
promised  assurances  would  inevitably  involve  a  litigious 
voyage  on  mostly-uncharted  waters.  Possibly  the  most 
secure  yet  flexible  alternative  is  the  front-end  software 
option.  Yet  from  the  archive's  standpoint,  this  is  probably 
the  most  expensive  of  the  three  alternatives;  putting  data 
into  one  of  these  packages  is  so  time-consuming  that  it  can 
practicably  be  utihzed  on  very  few  data  collections. 
Furthermore,  it  is  doubtful  that  front-end  software  is  wholly 
impervious  to  hacking  by  a  skilled  and  determined  violator. 
Finally,  the  learning  of  "yet  another"  software  package  and 
its  guaranteed  limitations  raises  the  bar  over  which 
interested  researchers  must  jump  to  access  needed  research 
data. 

Other  Alternatives  for  Protecting  Confidentiality 

Yet  other  mechanisms  have  been  proposed  or  are  being 
experimented  with  in  the  quest  for  the  "ideal"  way  of 
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protecting  respondent  confidentiality.  Brief  mention  will 
be  made  of  three  "positive"  alternatives,  before  we  close 
this  section  on  a  draconian  note.  Licensing  a  researcher  to 
use  a  data  set  containing  indirect  identifiers  is  a  variant  on 
the  private-use  data  set  arrangement  described  above.  Like 
it,  a  licensed  use  is  agreed  to  after  completion  of  a 
transaction  form.  Unlike  private-use  data  set  agreements, 
however,  most  licenses  impose  an  up-front  fee  in  the  form 
of  a  security  bond  as  surety  for  maintaining  confidentiality. 
The  fee  has  been  known  to  range  from  a  few  hundred  to 
many  thousands  of  dollars.  Several  license  mechanisms 
also  require  the  researcher  and  her/his  institution  to  assume 
all  legal  liability  in  any  instance  of  breaching 
confidentiality.  Needless  to  say,  the  popularity  of  this  form 
of  "access-with-assurance"  is  quite  low  in  the  research 
community  (not  to  mention  in  the  college/university  legal 
offices). 

A  second  alternative  method  is  being  discussed  in  more 
detail  elsewhere  at  this  conference,  and  so  will  be  briefly 
alluded  to  here.  This  is  the  "perturbing"  of  original  data 
values  to  break  the  certain  bond  between  any  given  data 
value  and  the  (possibly  identifiable)  individual  who  may 
have  provided  the  initial  information.  Since  the  essence  of 
this  technique  is  the  altering  of  original  data  values,  it 
remains  suspect  in  the  minds  of  several  generations  of 
social  scientists.  These  individuals  find  it  difficult  to 
overcome  one  legacy  of  their  training — getting  error  out  of 
research  data  collections — which  clashes  with  the  practice 
of  introducing  error  into  a  data  set  (however  noble  the 
purpose  underlying  that  introduction). 

Perhaps  more  promising  is  the  concept  of  secure  data 
analysis  laboratories.  In  such  facilities,  original  data 
would  be  available  for  data  analysis  in  a  controlled  setting, 
precluding  such  things  as  making  copies  of  original  data, 
investigating  single  cases,  or  transmitting  the  data  offsite. 
Scholars  would  apply  to  visit  the  site  to  do  data  analysis  in 
the  laboratory  under  secure  conditions.  An  experiment 
using  this  form  of  access  can  be  found  at  Carnegie  Mellon 
University,  for  its  Violence  Research  Consortium  Project 
supported  by  the  National  Science  Foundation  and  the 
National  /nstitute  of  Justice.  Data  from  the  National  Crime 
Victimization  Survey,  which  have  long  been  distributed 
without  geographic  sector  information,  are  available  with 
geographic  information  at  Carnegie  Mellon  to  the  violence 
consortium  members  This  mechanism  represents,  for 
social  scientists,  a  departure  from  a  long-term  trend  of 
facilitating  the  export  of  research  data  from  an  archive  or 
producer  site  directly  to  the  institution  (or  desktop! !)  of  the 
interested  scholar.  It  should  be  noted  parenthetically  that 
many  research  materials  utilized  by  both  historians  and 
social  scientists  are  available  only  by  visiting  the  site  where 
the  research  materials  are  housed.  Included  among  such 
facilities  are  traditional  archives  and  other  repositories, 
including  some  fine  social  science  collections  like  those  of 
the  Henry  Murray  Center  at  Radcliffe  College. 


Undoubtedly  more  costiy  for  the  individual  researcher  (and 
perhaps  for  the  archive  as  well),  this  mode  of  access  to 
confidential  data  may  become  more  common  with 
heightened  concern  for  preserving  confidentiality. 

The  search  for  suitable  mechanisms  for  protecting 
confidential  microdata  promises  to  become  a  high-stakes 
venture.  At  risk  is  the  Big  Kahuna  of  post-WWII  social 
scientific  research  practice — readily  available,  empirical 
microdata.  Some  in  the  statistical  and  social  science 
communities,  as  well  as  in  government,  are  beginning  to 
worry  about  the  release  of  any  microdata,  with  a  few  even 
predicting  its  demise. 

Conclusion 

The  very  progress  of  social  science  research  methodology 
has  made  it  more  difficult  to  safeguard  the  confidentiality 
of  the  research  data.  Removing  direct  identifiers  is  a 
foundational  requirement  for  public  use  data  sets  but  that  is 
essentially  a  trivial  task.  More  difficult  tasks  involve 
investigating  which  variables  could  be  used  as  indirect 
identifiers  and  modifying  them  without  significantly 
reducing  the  value  of  the  data  collection.  Careful  attention 
must  be  paid  to  interactions  among  the  context  of  the  study, 
the  nature  of  the  sample,  and  the  characteristics  of 
respondents  to  prevent  ordinarily  unrevealing  information 
from  becoming  the  pointer  to  an  individual.  But  many 
studies  today  involve  complex  research  designs  with 
multiple  levels  of  data  collection,  file  linkage  variables  that 
are  crucial  to  the  statistical  analysis,  sources  of  information 
that  are  intrinsically  locational  in  nature,  or  detailed 
descriptions  of  events  or  situations  that  can  be  cross- 
referenced  in  pubUcly  available  sources  like  the  media  or 
administrative  records.  Maintaining  complete  archival  files 
for  these  kinds  of  studies  may  involve  other  procedures 
than  simply  eliminating  or  modifying  variables. 
Procedures  used  in  the  past  or  under  development  include: 

•  conducting  contracted  analyses; 

•  creating  private  use  data  sets; 

•  developing  front  end  software  to  Umit  access  to  data 
records; 

•  hcensing  data  use; 

•  introducing  noise  (known  statistical  error)  into  data 
records; 

•  developing  data  laboratories  in  which  the  data  can 
not  be  removed  from  the  site. 
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The  International  Association  for 
Social  Science  Information  Services 
and  Technology  (lASSIST)  is  an 

international  association  of  individuals 
who  are  engaged  in  the  acquistion, 
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tion of  machine  readable  text  and/or 
numeric  social  science  data.  The 
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